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Svnt hesisinQ speech by converting phonemes to digital waveforms. 



This invention relates to synthetic speech and more 
particularly to a method of synthesising a digital waveform 
from signals representing phonemes. 

There are many circumstances, eg. in telephone 
systems, where it is convenient to use synthesised speech. 
In some applications the starting point is an electronic 
representation of conventional typography, eg. a 'disk 
produced by a word processor. Many stages of processing are 
needed to produce synthesised speech from such a starting 
point but, as a preliminary part of the processing, it is 
usual to convert the conventional text into a phonetic text. 
In this specification the signals representing such a 
phonetic text will be called "phonemes". Thus this invention 
addresses the problem of converting the signals representing 
phonemes into a digital waveform. It will be appreciated 
that the digital waveforms are commonplace in audio 
technology and digital-to-analogue converters and loud 
speakers are well known devices which enable digital 
waveforms to be converted into acoustic waveforms. 

Many processes for converting phonemes into digital 
waveforms have been proposed and it is conventional to do 
this by means of a linked database comprising a large number 
of entries, each having an access portion defined in phonemes 
and an output portion containing the digital waveform 
corresponding to the access phonemes. Clearly all the 
phonemes should be represented in the access portions but it 
is also known to incorporate strings of phonemes in addition. 
However, existing systems only take into account the phoneme 
strings contained in the access portions and do not further 
take into account the context of the strings. 

This invention, which is defined in the claims, uses a 
linked database to convert strings of phonemes into digital 
waveform but it also takes into account the context of the 
selected phoneme strings. The invention also comprises a 
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novel form of database which facilitates the taking into 
account of the context; and the invention also includes the 
method whereby the preferred database strings are selected 
from alternatives" stored therein. 
> A preferred embodiment of the invention will now be 

described by way of example. 

^GENERAL DESCRIPTION 

This general description is intended to identify some 
of the important integers of a preferred embodiment of the 
invention. Each of these integers will be described in 
" greater detail after this general description. 

The method of the invention converts input signals 
representing a text expressed in phonemes into a digital 
waveform which is ultimately converted into an acoustic wave. 
Before its conversion, the initial digital waveform may be 
further processed in accordance with methods which will be 
familiar to persons skilled in the art. 

The phoneme set used in the preferred embodiment 
conform to the SAMP-PA ( Speech . Assessment Methologies - 
Phonetic Alphabet) simple set number 6. It is to be 
understood that the method of the invention is carried out in 
electronic equipment and the phonemes are provided in the 
form of signals so that the method corresponds to the 
converting of an input waveform into an output waveform. 

The preferred embodiment of the invention converts 
waveform representing strings of one, two or three phonemes 
into digital waveform but it always operates on strings of 
five phonemes so that at, least one preceding and at least one 
following phoneme is taken into account. This has the effect 
that, when alternative strings of five phonemes are 
available, the "best" context is selected. 

It has just been explained that this invention makes 
particular use of a string of five phonemes and this string 
will hereinafter be called a "context window" and the five 
phonemes which constitute the "context window" will be 
identified as PI, P2, P3, P4 and P5 in sequence. 



It is a key feature of this invention that a " data context 
window" being five consecutive phonemes from the input signal 
is matched with an "access context window" being a sequence 
of five consecutive phonemes contained in the database.. 

. .. . The prior art includes techniques in which variable 

length strings are converted into digital waveform. -.-.However, 
the context --of the selected strings, . .is not taken into 
.-.account. - a -Each phoneme comprised -in a selected string is,, of 
-.course, in context with all the other phonemes of .the string 
but the context of the string as a whole is not.„taken into 
-account. This . invention not only takes into account the 
contexts . within the selected string but it- also selects a 
.best matching string from the strings available .in the 
database.' This specification will now describe, ^important 
integers of preferred embodiment namely: - -o^s ■■. 

(i ) , the definition of "best". - as .. r ;used_~,in the 
r --. selections; . : . o^.l-is: *.«>~izz.*ss.az ■ 

. .. (ii) -, ,.the .configuration of the database which. :S tores 
... - the signal representations. ; pf the. data .context 
. . . .. .windows together with .-their corresponding 

digital wave forms; : :i Lum 

.(iii) the method of selection for (ii) using (i); 

-. - and ■ - • .. ~. 

, . (iv) -picking one of the various ..alternatives 

— . _ provided by (iii). ... -• :ui . .." ~ ,i 

1*1 nw "TjgST" . .. 

This ■ -invention, selects from alternative context 

windows on -the basis of a "best" match between. ; the -input 
context window and the various stored context windows. ...;Si nee 
there are many, e.g. 10 3 or 10 10 - possible contexts windows (of 
5 phonemes each) it is not possible to store all of -them, 
i.e. the database will lack some of the possible context 
windows. If all possible context .windows . were. .,s tored it 
would not be. necessary to define a "best" match : since i an 
exact correspondence would always be available, r-- However, 
each individual phoneme should be included in the database 
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and it is always possible to achieve an exact match for at 
• t least one phoneme, in the preferred embodiment it is always 
possible to match exactly P3 of the data context window with 
. p 3 of the "stored context window but, ' in general, further 
5 exact matches may not be possible. 

. Th is invention defines a correlation parameter between 
z .,v ;• two phonemes as follows. Corresponding to each phoneme there 
...v.-.- is a " type-vector which consists of an ordered list -of co- 
-: ^. .efficients: --Each of these co-efficients represents a feature 
.7 40 lot - its phoneme, -e.g. whether its " phoneme is voiced or 
•unvoiced or whether or not its -phoneme is a . silibant,- a 
;; .'plosive or a labil. It is also desirable to include 
, Vocational features, eg whether or not the phoneme is in a 
stressed or unstressed syllable. Thus, the type vector 
15 -uniquely characterises its phoneme and two phonemes can be 
compared by comparing their type-vectors co-efficient by co- 
efficient; e.g. .by using an exclusive-or gate (which is 
sometimes called an equivalence gate). The number of 
matchings is one way of defining the correlation parameter. 
.. ,20 ' If desired this can be converted to a percentage by dividing 
, - . .: by the maximum possible value " of the parameter and 

multiplying by 100. 
. . • (As an alternative, a mis-match parameter can be 

defined e.g. by counting the number of discrepancies in the 
, 25 two type vectors. It will be appreciated that selecting an 
-best" match is equivalent to selecting a lowest mis-match. ) 

The primary definition relates to the correlation 
parameter of" a pair of phonemes. The correlation parameter 
:of a string is obtained by summing or ^ averaging the 
■:--30 parameters of the corresponding pairs in the two strings. 
' •-...-■Weighted averages can be utilised where appropriate. 

In the preferred embodiment, the database is based on 
an extended -passage of the selected language, eg English 
-35 • (although the information content of the passage is not 
' important). A suitable passage lasts about two or three 
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minutes and it contains about 1000-1500 phonemes. The 
precise nature of the extended passage is not particularly 

- ■ important although it must contain every phoneme ' and it 

should contain every phoneme in a variety of contexts. 
5 The extended passage can be stored in two different 

formats. First the extended passage can be expressed in 
- phonemes to provide the access section of a "linked database. 

- • - More specifically, the phonemes representing "'the " extended 
.-. _*'■:"' passage are divided into context windows each of which 
:10 "'contains 5 phonemes.' The method of the "invention comprises 

•""obtaining best matches for the data context windows "with the 
'•*•.*• - stored context windows just identified. 

■ The extended passage can also be provided in the form 

of a digitised wave form. As would be expected, ~ this is 
15 achieved by having a reader or reciter speak the extended 
; passage into a "microphone so as to make a digital recording 
using well established technology. Any point in the digital 
recording can be. defined by a parameter, e. g. "by the time 
from the start. Analysing the recording -establishes values 
20 for the time-parameter corresponding to the ~ break ' between 
• each pair of phonemes in the equivalent " text.*:"" This 
* ' arrangement permits phoneme-to-waveform conversion for any 
•""-•' included string by establishing the starting "value" of the 
time-parameter corresponding to the first phoneme of the 
•25 string and the finishing value for the time-parameter 
corresponding to ' the last phoneme of the string and 
retrieving the equivalent portion of database, ie the 
specified digital waveform. Specifically a conversion for 
any string of one, two or three phonemes can be achieved. 
30 The important requirement is to select the best 

portion of the extended text for the conversion. 

It has already been mentioned that the' phoneme version 
of the extended text is stored in the form of context windows 
'each of five phonemes. This is most suitably achieved by 
35 storing the phonemes in a tree which has three hierarchical 
i eve i s . 

The first level of the hierarchy is defined by phoneme 
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?3 of each window. The effect is that every phoneme gives 
direct access to a subset of the context windows ie. the 
^totality of context windows is divided into subsets and each 
subset has the same value of P3. 

The next level of the tree is defined by phonemes P2 
and P4 and, since this selection is made from the subsets 
• defined above, the effect is that the totality of context 
windows is further divided into smaller subsets each of which 
is defined by having phonemes P2, P3. and P4 in common. 
, (There are approximately half a million subsets but most of 
.3 them wil l be empty because the relevant sequence P2, P3, P4 
does not occur in the extended text). Empty subsets are not 
recorded at all so that the database remains of manageable 
size. Nevertheless it is true that for each -triple sequence 
15, ?2, ?3, P4 which occurs in the extended text there will be a 
_. subset recorded in the second level of the database under P2, 
. 'P4 which level will also have been indexed at the first level 
under P3. 

Finally the second level gives access to a third level 
20 which contains subsets having P2, ; .P3 and P4 as exact matches 
and it contains all the values of PI and P5 corresponding to 
these triples. Best matches for data PI and P5 are selected. 
This selection completely identifies one of the context 
windows contained in the extended text and it provides access 
25 to time-parameters of said window. Specifically it provides 
_. start and finish time-parameters .for -up to four different 
, strings as follows: - 

(a) P3 by itself; 

( b ) th e pair of phonemes P2 + P3; ' . 
0 (c) the pair of phonemes P3 -+ P4; and 

(d) the triple consisting of the phonemes P2 + P3 

+ P4. 

In the first instance, the database provides beginning 
and ending vaiues of the time-parameter corresponding to each 
5 one cf the selected strings (a) - (d). As explained above, 
the time-parameter defines the relevant portion of a digital 
wave form so that the equivalent wave form is selected. 
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It should be noted that item (d) will be offered if it 
is contained in the database; in this case items (a), (b), 
. . and (c) are all embedded in the selected (d) and they are, 
• therefore, available as alternatives. If item (d) is not 
5 contained in the database then, clearly, this option cannot 

,= be offered. . • . • , ■ 

. Even if item (d) is missing from the database, then 
items (b) and/or (c; may still be present in the database. 
When both of these options are offered they will usually 
10 arise from different: parts of the database because item (d) 
■is missing. • Therefore, depending on the content of -the 
database, the selection will offer (b) alone, or (c) alone, 
:-r vipr both (b) and (c). Thus the selection may provide a choice 
.and in any case item (a) is available because it is embedded 
lk:15 .in the pair. : : 

Finally, even if (b), (c) and (d) are all absent from 
. ' .: the database, item (a) will always be present and thus "best 
match" will be offered for the single phoneme and this will 
.be the only possibility which is offered. - - 

'-■■-20 • .,. ...It will be apparent that items (b), (c) and "(d) imply 
••• that strings will overlap. Thus whenever i tern '='( c )" " is 
. •.- selected for any phoneme then item (b) must be available for 
.the next phoneme. If nothing better offered, then the same 
parr of the database will meet the requirements of (c) for 
25 the earlier phoneme and (b) for the later but because 
^different correlations are involved better choices may -be 
-selected. It will also be apparent that whenever item (d) is 
available item (c) will be available for the previous phoneme 
- and, in addition, item (b) will be available for the ! 
-30 -following phoneme. In other words, some of the strings will 
■ overlap, ie there will be alternatives for some phonemes such 
that the same phoneme occurs in different places in different 
strings. This aspect of the invention is described in 

- : greater detail below. 

3 5 It has been emphasised that the preferred embodiment 

•" - - -is based on a context window which is five phonemes ^long. 
However the full string of five phonemes is never selected. 
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Even if, f onuitously, the input text contains a string of 
five found in the database only the triple string P2, P3, P4 
will be used. This emphasises that the important feature of 
: the invention is the selection of a string from a context 
.1. 5- and. therefore, the invention selects the "best" context 
window of five phonemes and only uses a portion thereof in 
order to ensure that all selected strings are based upon a 
- j. .context. 

SELECTION QF »g g ST » WINDOW . 
.10.,; -. The analysis of the text into phonemes contained in 

•v. . . the database is carried out phoneme by phoneme, but each 
phoneme is utilised in its context window. The next part of 
the description will be based upon the selection procedure 
for one of the data phonemes it being understood that the 
15 same procedure is used for each of the data phonemes. 



but as part of its context window. More precisely the 
selected data phoneme becomes phoneme P3 of a data window 
with its two predecessors and two successors being selected 
20 : to provide the five phonemes of -.the relevant context window. 
, The database described above is searched for this context 
window; since it is unlikely that the exact window will be 
located, the search is for the best fitting of the stored 
context windows. 

25 The first step of the search involves accessing the 

tree described above using phoneme P3 as the . indexing 
element. As explained above this gives immediate access to 
a subset of the stored context windows. More specifically, 
accessing level one by phoneme P3 gives access to a list of 

30 phoneme pairs which correspond to. possible values of P2 and 
P4 of the data context-window. The best pair is selected 
• according to the following four criteria. 



pair in the sub-set gives an exact match for data P2 and P4. 
3 5 When this happens that pair is selected and the search 
■ immediately proceeds to level 3. This outcome is unlikely 



The selected data phoneme is not utilised in isolation 



First criterion Fortuitously , it may happen that one 
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because, as explained in grearer detail above, the string P2, 
P3,' ?4 may not be' contained in the extended passage. 

Second criterion. In the absence of a triple match a 
• -left pair will be selected if it occurs. The left-hand match 
i- 5 is • selected when an exact' match for P2 is found and, if 
•' •: '-alternatives offer, the P4 which has the highest correlation 
--parameter will be selected to give access to level ' 3 of the 
• -'•tree. - 

•• :. ■- ' -The third criterion 'is similar "to the second except 
10 that it' is a right-hand pair 'depending upon an exact match 
being discovered for P4. - In "this case access to level 3 is 
given by the P2 value which provides the highest correlation 
parameter. 

Criterion' four occurs - when there "is. no match for 
15 either P2 or P3 in which the" case the pair P2, P4 with the 
highest average correlation parameter is selected as the 
~ - "basis of access to level 3. - - " " " ~- 

i . :■: • - it will be noted that if criterion"! succeeds, then it 

will be possible to take as alternatives : a "left-hand pair, a 
-20 right-hand pair and a single value in accordance with 
'~ ; -' criterion 2, 3 and 4. 

-Even if criterion 1 fails, it is still possible that 
a left-hand pair will be found by criterion 2 and it is even 
- possible that, simultaneously,- a right-hand pair will be 
25 found by criterion 3. However because criterion 1 has failed 
" -they will be selected from different parts of the database 
" • an d they will give access to different parts of the tree at 
••" 'level 3. : 

Finally criterion 4 will only be accepted when 
30 criterion 1, 2 and 3 have all failed and' : it follows that the 
•"- phoneme P3 cannot be found in triples or" pairings when used 
~ :i> ~ in other context windows. 
; ■ '•' Thus, when criterion 1 or 4 are -utilised there will 

-only be access to one portion of the tree at the third level 
35 but it is possible, when criterion 2 and 3 "are used that 
" there will be access to two different parts of the third 
level. 
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We have now described how the selection of a context 
window gives rise to either one or two areas of the third 
. level of the tree. In each case the third level may contain 
several pairings for phonemes 1 and 5. of the data .context 
5 window. The pair with the best average correlation parameter 
^ is selected as the context window in the access portion of 
-- the database. As explained above this _ context window is 

converted to digital wave form using the time-parameter. 
v: _ , To re -emphasise; where criterion .1 is used only one 

-10, context window is selected but .it ; gives rise to four 

possibilities, namely time-parameter ranges for:-- • 
, J7 - the triple P2 + P3 + P4; 

(ii) the left-hand pair P2 + P3; 
. (iii) The right-hand pair P3 + P4, and; 
.,15 .(iv) the single P3 by itself. 

When criterion 2 operates, this provides time- 
parameter ranges only for the left-hand pair P2 + P3 and for 
,, ?r a single P3 by itself. When criterion 3 operates similar 
considerations apply but the parameter ranges are for the 
-,20 right-hand pair .P2 + P3 and for the single P4. If both 
criterion operate this offers two choices for the single P3 
and only the one with the higher correlation parameter for PI 
„ . _ + P5 is selected. 

Finally when criterion 4 operates there only one 
25 possibility .namely .the phoneme ..P 3 by itself. j~ 
The description given above explains how conversions 
are provided for each phoneme of an input text. Sometimes 
the method provides a conversion for only a single phoneme 
; . t% and, in this case, no alternatives are offered. In some 
30 -cases the method provides conversion for strings of two or 
three adjacent phonemes and, .in these circumstances, the 
conversion provides alternatives for at least one phoneme. 
In order to complete the selection, it is necessary to reduce 
the number of alternatives to one. The preferred method of 
35 achieving this reduction will now be explained. _ ,. 

■ 4 The preferred . method of making the reduction is 

carried out by processing a short segment of input text, eg. 
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a segment: which begins and ends with a silence. Provided it 
•' is nor too long a sentence constitutes a suitable segment. 
If a sentence is very long, e.g. more than thirty words, it 
" usually contains one or more embedded silences,' eg between 
5 clauses or other sub-units. ' In the case of long sentences 
such sub-units are suitable for use as the segments. 

~ *" The processing of a segment to reduce each set of 

alternatives to one will now be described. As' mentioned, no 
-' l "alternative "will be offered for "some of "the 'phonemes and, 
"10 "therefore, no"" selection" is required for "these phonemes. 
Alternatives "will "be available for the other phonemes and the 
selection is made so as to produce "a " best" " result 'for the 

'segment as "a whole. This may involve making a 'locally "less 

good" selection at one point in the segment in "order to 
"15 obtain "better" selection elsewhere in the segment:/ The 
criteria of "better" include:- 

(i ) ' taking longer strings' * 'rathe'r '"'than ' "shorter 

'strings, and ' " ; ' " 

(ii) selecting from strings which "overlap "' 'rather 
"20 ' than from strings which' 'merely'"" abut. 

The rejection of unwanted alternatives' produces a 
position in which each phoneme has one, and only one, 
conversion.' In other "words the. input text will have been 
divided into sub-strings of 1, 2 or 3 phonemes matching the 
'25 database and the beginning and ending values "for the selected 
streams "will therefore be established. The output portion 
of the database takes the form of a digitised waveform and 
the parameters which have been established define segments of 
this waveform. Therefore the" designated segments are 

30 selected and abutted to produce the digital "waveform 
corresponding to the input text. This completes the 
requirement of the invention. 

Having obtained a digital waveform . this "can be 
provided as audible output using conventional digital "to 

35 analogue conversion techniques and conventional loudspeakers. 
If desired, "the. primary "digital waveform can be enhanced 
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using techniques known to those skilled in the art. 

The invention will be . further described by way of 
7 example with reference to the accompanying drawings in 
which: - 

' q " Figure 1 illustrates diagrammatically a speech engine 

in accordance with the invention; and 

Figure 2 shows a speech engine as illustrated in 
Figure 1 attached to a telephone ..network. 

As shown in Figure 1 the speech engine according to 

"lO the invention comprises^ primary processor 11 which is adapted 

' 7 to accept text in graphemes and to produce therefrom an 
equivalent text in phonemes. This text is passed to 
"converter 12 which is operatively associated with a database 
13 in accordance with the invention. Converter 12 matches 
15 segments of the phoneme text with segments stored in the 
access portion of database 13. Thus segments of digital 
waveform are retrieved and these are assembled into extended 

~ " portions of digital waveform corresponding to extended 

portions of the original input. 
20 These extended portions of digital waveform are passed 

to waveform processor 14 where they are subjected to further 
processing in order to produce a smooth output. Finally the 
digital output is converted into an analogue waveform which 
is provided at output port 15 for onward transmission. 

2 5 As shown in Figure 1 the speech engine is connected to 

"receive its input from an external . database 1 6 whiqh ^holds 
, texts, in conventional orthography. External database 16 is 
conveniently operated by keyboard 17 to select a text stored 
in database 16. This text is provided into the primary 
30 " converter 11 and it appears at the output port 15 as an 
analogue waveform. 

Figure 2 shows a speech engine as illustrated in 
Figure 1 attached to a public access telephone network. As 
shown in Figure 2, a conventional speech telephone 20 is 

3 5 connected to a station 22 via a switched access network 21. 

Station 22 includes a speech engine as shown in Figure 1 and 
' the output port 15 is connected to the network so that the 
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information available in "the exrernal database 16 can be 
provided, as an analogue acoustic waveform, to the telephone 

20. ' - ' ' • -• •' - ' 

-If desired -the keypad (used for "- "dialling)' of the 
r - ■' 5 telephone 20 can be used as " the keyboard-' 1 7 of the "external 
-2- ' -database' 16 "(in* which --case' the' external '--database 16 
preferably contains instructions which" "can be " read by the 
r; • : -rgpeech engine") . ' A simpler technical ' arrangement provides a 
■-- human "operator at the -station ; 20 " and "the' -human operator 
10 actuates the keyboard 17 in accordance : with instructions 
;- " • received over the network 21. " When the' operator has selected 
portion of text "this " is - read by the 'speech engine and 
-■">"—' further participation by the "operator is unnecessary. Thus 
the operator is freed to assist with' further enquiries and 
■• 15 "the use of a "speech engine enhances the efficiency of the 
-operation.. : ' •* ~ ~ ■ 

It will be appreciated that" -there are many other 
-applications for -a speech engine "according 'to the invention, 
:.■ -e. g. it "is suitable for • connection" to -a public" -address 
-20 system."-'-'- •'- -'- - - '•' '"<■>■' - — 



